Protein structure prediction by an iterative search method 
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Abstract 

We demonstrate a new algorithm for finding protein conformations tliat minimize a non-bonded energy 
function. The new algorithm, called the difference map, seeks to find an atomic configuration that is simul- 
taneously in two constraint spaces. The first constraint space is the space of atomic configurations that have 
a valid peptide geometry, while the second is the space of configurations that have a non-bonded energy 
below a given target. These two constraint spaces are used to define a deterministic dynamical system, 
whose fixed points produce atomic configurations in the intersection of the two constraint spaces. The rate 
at which the difference map produces low energy protein conformations is compared with that of a con- 
temporary search algorithm, parallel tempering. The results indicate the difference map finds low energy 
protein conformations at a significantly higher rate then parallel tempering. 
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I. INTRODUCTION 



Since the three dimensional structure of a protein largely determines its function, there is 
tremendous incentive to determine the native structure of proteins. For proteins that form high 
quality crystals, x-ray crystallographic methods have long enabled researchers to determine the 
protein's structure. For proteins that fail to form high quality crystals, NMR spectroscopy is a time 
consuming and expensive alternative. Currently, there are many proteins with known sequences 
and unknown structure. As an alternative to x-ray crystallography and NMR spectroscopy, a re- 
liable structure prediction method would be a tremendous asset to biological research. The vari- 
ous structure prediction methods are compared at the semi-annual CASP experiment.- In the past 
CASP experiments, the most successful structure prediction method has been homology modeling. 
In recent years though, ab initio methods have started to become competitive. 

The method of homology modeling is based on the observation that when two proteins have 
similar amino acid sequences, they also usually have similar structural properties.^ Using this 
method, a protein's structure is determined first by comparing its amino acid sequence against 
other determined structures in the Protein Data Bank, and finding similar sequences.^ For exam- 
ple, if a particular subsequence of amino acids almost always forms an alpha helix, then if found 
in the undetermined protein's sequence, the structure of this sub-sequence can be safely guessed. 
In this way, the structure is piece-wise determined, and subsequently assembled."* This technique 
relies heavily on the availability of similar template sequences whose structures have been deter- 
mined. For large classes of proteins, such as membrane proteins, there is a dearth of templates for 
comparison. For such proteins, homology modeling currently offers little promise. 

With ab initio structure prediction, the protein is modeled as a collection of atoms^, or united 
atoms''^, and the native structure is assumed to be the global minimum of an appropriate energy 
function.^ Because the actual energy function navigated by physical proteins is difficult to cal- 
culate precisely, there are numerous approximations in use. Finding the global minimum of the 
energy function is itself a very challenging endeavor,— and many different methods have also been 
developed for this. All of the modem energy minimization algorithms require great computational 
resources, and ab initio methods have been limited to small proteins (approximately fifty amino 
acids). 

In this paper we consider a very simple energy function for the non-bonded interactions (ex- 
plained in appendix A) and propose a new method for finding its global minimum. The proposed 
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method, the difference map (DM)—, has previously been shown to be successful at finding low en- 
ergy states in an off-lattice HP model of proteins .-^^^ The DM operates in a very different manner 
than previous search algorithms used to minimize the conformational energy. Most energy min- 
imization methods are based on a Monte Carlo exploration of the protein conformation's energy 
landscape. For these methods, the "iterate" is an evolving protein conformation. Contrasting this, 
the DM "iterate" is not a protein conformation, but an atomic configuration. Since a polypeptide 
has on average about three degrees of freedom per amino acid, and an atomic configuration has 
three degrees of freedom per atom, the iterate of the DM searches a much larger space than that 
explored by Monte Carlo methods. Searching this larger space is not necessarily a liability: deep 
local minima in the energy landscape that would trap a Monte Carlo iterate are easily escaped by 
the DM, since the DM can evolve the iterate in directions not accessible to the Monte Carlo iterate. 

The DM overcomes three fundamental deficiencies of all Monte Carlo based search techniques. 
First, in all Monte Carlo based searches, an entire folding pathway must be simulated in order to 
find the lowest energy conformation. The DM overcomes this by immediately searching for the 
lowest energy state, without regard to the folding pathway. Second, Monte Carlo search methods 
have a tendency of getting stuck in deep local minima of the energy landscape. There have been 
many modifications to the basic method to overcome this problem^"^'^^. However, though lessened, 
the problem remains to a degree. Contrasting this, the DM escapes even deep local minima in the 
energy landscape, and spends very little time exploring them. And finally, Monte Carlo search 
methods update the iterate by local modifications to the protein conformation, thus limiting the 
rate by which the protein conformation can evolve. The DM typically makes large modifications 
to the configuration in each iteration. Eventually, when the DM encounters a true fixed point, a 
low energy conformation has been found. 

We have applied the DM algorithm to an all-atom protein model (sidechain hydrogens have 
been omitted, though backbone hydrogens are included for the purpose of hydrogen bonding). 
The performance of the DM is compared to that of a popular Monte Carlo method, parallel tem- 
pering (PT). To make the comparison meaningful, the two algorithms are each run on the same 
computer, running the same amount of time. The atomic potential used is as simple as possible, in- 
volving only hydrophobic-hydrophilic interactions, hydrogen bonds, and steric repulsion. Though 
simple, this potential is able to correctly reproduce the general structure of the native fold of the 
staphylococcus aurelius A protein (B domain (10-55) ). In this paper we will refer to this protein 
as "protein A". 
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II. THEORY 



A. Constraints and projections 

The difference map (DM) is an iterative algorithm where the iterate (an atomic configuration) 
is evolved by means of projections onto two constraint spaces. The first constraint space is the 
space of atomic configurations that have a valid peptide geometry. A member of this constraint 
space has all bond lengths, bond angles, and left handed versus right handed orientations correct 
(bond lengths and angles are taken from Engh 1991'^). This is the space of the rotamer configura- 
tions. Most contemporary Monte Carlo searches have the folding protein always a member of this 
constraint space; the energy landscape is usually viewed as an energy function on this space. The 
second constraint space used by the DM is the space of atomic configurations whose non-bonded 
energy is less than a predefined target energy. When freed of the peptide geometry constraint, it 
is easy to find a member of this constraint space. It is clear that when an atomic configuration is 
found that is a member of both constraint spaces simultaneously, the problem has been solved. In 
this case, an atomic configuration that has both a valid peptide geometry, and a sufficiently low 
energy, has been found. The two constraint spaces are described in detail in Appendix A. 

We represent an atomic configuration by R = {Fi, r2, . . . }, where fj is the 3D coordinate 
of atom i. For both constraints, the projection to that constraint space is defined as the closest 
atomic configuration that satisfies the constraint. In this paper, Pq R denotes the projection to 
the peptide geometry constraint, while Pe R denotes the projection to the energy constraint. 

For the geometry constraint, the projection is accomplished by minimizing a penalty function 
(defined in Appendix A) via an adaptive step-size steepest descent algorithm. This projection 
performs a minimal modification to the atomic configuration that yields a member of the geometry 
constraint space. The result of this projection has a valid peptide geometry, but non-bonded atoms 
are allowed to overlap, and in general there is no bias toward a low energy atomic configuration. 

To compute the projection to the energy constraint, the energy function defined in Appendix 
A is minimized until the non-bonded energy is below a predefined target energy. Though the 
result of this projection is a low energy atomic configuration, the configuration in general does 
not have a valid peptide geometry. For a typical member of this constraint space, bond and angle 
constraints are usually not satisfied. While computing this projection, the protein behaves as a 
liquid of independent atoms, rather than as a linked chain of amino acids. 



4 



B. Difference map algorithm 



As a simple pedagogical step toward understanding the DM algorithm, first consider the fol- 
lowing alternative algorithm, called alternating projections (AP): 
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For AP, the iterate is projected to the energy constraint, followed by a projection to the geometry 
constraint. With the projections in this order, the iterate is perpetually a member of the geome- 
try constraint space. This algorithm greedily minimizes the distance between the two constraint 
spaces, and quickly evolves toward a fixed point, where the distance between the two constraint 
spaces has a local minimum. 

To contrast AP and the DM, they are both applied to a 2D example problem in figure [TJ In this 
example, the two constraint spaces are the red and blue curves, DM iteration is shown as green 
dots, and AP iteration is shown as the gold dots. If the initial iterate is close to an actual intersec- 
tion of the constraint spaces (top red dot in figure [T]), then AP will converge to the intersection. 
However, AP is prone to stagnating at places where the distance between the constraint spaces is 
locally minimized (bottom trajectory in figure [T]). Finally, note that the iterate of AP is always a 
member of the blue constraint space. 

The DM was developed to remedy the stagnation problem of the AP algorithm.-^^ The DM 
iterate is updated by R„+i = R„ + d, where 
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Clearly a fixed point has been found when d = 0. If R is a fixed point of the DM, the correspond- 
ing solution (Rsoi) is given by 
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Since Rsoi is simultaneously equal to a projection to each constraint space, it both has the correct 
peptide geometry, and a sufficiently low energy. When the iterate is sufficiently near a fixed point, 
the attractive property of the DM leads to monotonic convergence to the fixed point.— 

The extent to which the native conformation of protein A is an attractive fixed point of the DM 
is shown in figure [21 Here, the initial iterate of the DM was chosen by adding random vectors 
of constant magnitude to protein A's atomic coordinates. The DM then evolved the iterate, and 
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FIG. 1: A 2D example problem contrasts the search dynamics of the DM and alternating projections (AP). 
The two constraint spaces are shown as red (vertical Une), and blue (circles). Two initial points (red dots) 
are iterated via the DM (green dots, black line) and AP (gold dots, gray line). The dashed line is the set 
of fixed points of the DM. Every fixed point of the DM is associated with the unique intersection of the 
constraint spaces. For the top initial point, both search algorithms find the solution. For the bottom initial 
point, AP stagnates at a near intersection of the constraint spaces, while the DM is repelled by this near 
intersection, and eventually finds the actual intersection. The iterate of AP is always a member of the blue 
constraint space. 

terminated when the iterate converged upon a fixed point. This perturbation followed by DM 
iteration was tested 100 times, for many different magnitudes of the perturbation. The average 
number of iterations before a fixed point was found is displayed. Given the same initial iterate, the 
convergence rate of the AP map was tested in the same way, and is also shown. 

Though prone to stagnation, AP is a useful algorithm for finding the nearest local minimum of 
the conventional energy landscape. This energy refinement is done by first projecting the atomic 
configuration to the geometry constraint (yielding a valid protein conformation) and evaluating 
the atomic configuration's non-bonded energy. Next, the atomic configuration is projected to the 
energy constraint with a projection target energy only slightly lower than the current energy (this 
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FIG. 2: The convergence rates of the DM and AP are compared. If the atomic configuration is within 1 A 
RMSD of the native fold, both search algorithms always converge upon the native fold, on average within 
30 iterations. Above 1.25 A RMSD, both algorithms occasionally fail to recognize the nearby intersection 
of the constraint spaces. 

moves the iterate a small step in the downhill gradient direction). Finally, the atomic configuration 
is again projected to the geometry constraint. These alternating projections, with the target energy 
continually being lowered, quickly lowers the energy of the protein conformation, and eventually 
finds a fixed point at a nearby local minimum in the energy landscape. These are the same local 
minima that could potentially trap a Monte Carlo search iterate. 

To generate low energy protein conformations, the DM searched for seven days on four parallel 
processors, each 3 GHz. Each processor operated independently of the others, and every search 
began with a configuration of random atom positions. The initial atom coordinates where chosen 
from inside a box with a uniform probability distribution. 

Every three hundred DM iterations, the current iterate was refined via AP until a nearby fixed 
point of AP was found. The fixed point was the nearest local minimum of the conformational 
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energy landscape, and represented the best estimate for an atomic configuration satisfying both 
the geometry constraint and the energy constraint. The energy and RMSD (all atoms) of these 
estimates are plotted in figure [3] (green dots). 

After refining via AP, if the energy of an estimate was below the target energy of the energy 
constraint, the target energy was lowered to this new lower energy. The iterative DM search was 
then restarted with a new random initial atomic configuration. On the other hand, if after refining 
the energy of the estimate was above the cutoff energy, the DM iterate was replaced with the 
refined estimate, and DM iterations continued. 

C. Parallel tempering algorithm 

The difference map was compared to one of the state-of-the-art minimization algorithms, paral- 
lel tempering (PT).-^^^ PT has had significant success in folding small proteins.-^^ The method is 
a modified Monte Carlo search. For each search, there are several clones of the same initial atomic 
configuration. Each clone is evolved via Monte Carlo steps at a different temperature. At every 
iteration there is a probability of a swap between any two clones (a swap consists in switching 
their temperatures). The probability is a function of the clones' current energies and temperatures. 
Additionally, after a large number of Monte Carlo iterations, the atomic configuration of the lowest 
energy clone replaces the atomic configuration of the hottest clone. 

The Monte Carlo step is computed by adding to each clone's atomic coordinates, random vec- 
tors of a given magnitude. After this perturbation, the atomic configuration is projected to the 
geometry constraint space. The result of these two operations is a slightly perturbed atomic con- 
figuration that has a valid peptide geometry. After the perturbation and projection, the energy of 
the new protein conformation is calculated, and the probability of accepting or rejecting the test 
step is calculated. The magnitude of the random perturbation is adjusted for each temperature to 
maintain a step acceptance rate of 50%. 

Exactly the same computational resources were applied to the PT algorithm. We used four ran- 
dom initial configurations (one on each processor). The initial atomic coordinates were generated 
by first choosing atom positions from inside a box with a uniform probability distribution, and 
then projecting the atomic configuration to the geometry constraint. For each of the four simula- 
tions, we used fifteen clones, whose temperatures ranged from 2.92 to 0.01 (in the energy scale 
described in Appendix A). A temperature of 2.92 was hot enough that the clone with this tempera- 
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ture quickly explored the energy landscape, and spent very little time in any one energy minimum. 
On the other hand, the clone with a temperature of 0.01 was essentially frozen: its energy was 
low, and fluctuated only very little. Each of the four simulations made consistent progress toward 
lower energies. 

With our PT code, we averaged 1.7 seconds per iteration, for a single processor, with fifteen 
clones. This is close to previously published iteration rates. In their 2005 paper,— Schug et. al. 
averaged about one million iterations per fifteen clones using fifteen processors in one day, for a 
protein with five amino acids. This corresponds to about ten seconds per iteration, for a single 
computer, with fifteen clones, iterating a protein with forty amino acids. The fact that our PT 
iterations are faster is due to our comparatively simple potential. 

III. RESULTS 

After one week of searching, both algorithms found many low energy atomic configurations of 
protein A. The RMSD (all atom) from the native fold versus the energy of these folded proteins is 
shown in figure [3l In this figure, the low energy conformations discovered by the DM are shown 
as green dots, and the conformations explored by PT are shown as yellow crosses. As can be 
seen from the figure, the PT simulations are still progressing towards lower energy conformations. 
Previous studies suggest the PT simulations will find the global minimum of the energy landscape 
when given enough time. 

The lowest energy conformation the DM found had little resemblance to the native fold. The 
conformation with the lowest RMSD (4.4A) found by the DM had an energy of -71.1, compared 
to the energy -66.6 of the native fold. Both protein conformations are shown in figurelH We do not 
claim that the lowest energy protein conformation found by the DM is the lowest possible, only 
that given the same amount of time and resources, the DM finds many more low energy states 
than PT, as is evident from figure [3l The existence of folds with energy lower than the native fold 
points out a deficiency of our minimalistic potential. Here we are demonstrating the effectiveness 
of a new search algorithm toward the purpose of finding low energy states, rather than validating 
a candidate potential. 

We tested several other proteins, and a range of potential parameters, and found the DM almost 
always finds a lower energy state than the native conformation. This is shown in table HI To adjust 
the energy function, the relative scale of the hydrophobic energy to the hydrogen bonding energy 
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FIG. 3: Here RMSD (all atom) versus energy is plotted for the results from both search algorithms. The 
green dots are the output configurations found by the DM, while the gold crosses are those found by PT. 
Clearly the PT simulations are still progressing toward lower energy. Both methods ran one week. The blue 
dot is the most native-like fold discovered, and the protein conformation is shown in figure |4l The red dot 
is the native fold, also shown in figure |4l 

was varied. In tableHl the parameter Fh b represents the fraction of the total energy due to hydrogen 
bonding in the native fold, and was adjusted by changing the pref actor of the hydrophobic term in 
the energy function. 

IV. DISCUSSION 

As can be seen in figure [3l the PT data points form an almost continuous trajectory. There were 
four different PT simulations, and there can be seen four yellow streaks, each occasionally broken 
by a discontinuity. By being constrainted to move in the usual energy landscape, PT is forced 
to make small modifications to the protein conformation. Because of this, the PT simulations 
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FIG. 4: The most native-like protein conformation (blue) found by the DM is shown compared to the native 
fold (red) of protein A. This protein conformation was found during one week of computation. It has an 
RMSD (all atom) of 4.4 A, and an energy of -71.1. The native fold has an energy of -66.6. 

in effect reconstruct the folding pathway. If the goal is to find the lowest energy conformation, 
however, then simulating the entire folding pathway is unnecessary. While the PT algorithm was 
simulating the folding pathway of the protein, the DM was searching for low energy conformations 
directly, with no regard for the physical pathway. This accounts, to a large extent, for the superior 
performance of the DM algorithm. 

The most native-like protein conformation produced by the DM is shown in figure IH This 
conformation, like the native fold, has three helices. The DM also found many lower energy 
folds — a defect of our minimalistic potential. Our computationally simple potential nevertheless 
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TABLE I: The performance of the DM on seven proteins, and for a range of energy parameters. The fraction 
of the native conformation's energy due to hydrogen bonding is Fhb- This is adjusted by varying the pref- 
actor of the hydrophobic term in the energy function. For every choice of Fhb, the native conformation's 
energy is given along with the best fold discovered by the DM. 



PDB code (length) 


Fhb 


native 


lowest found 


search time (3GHz) 




0.86 


-33.2 


-34.8 


45 hours 


Ibba (30) 


0.69 


-50.4 


-56.0 


9 hours 




0.5 


-56.7 


-60.3 


35 hours 




0.85 


-71.4 


-80.0 


45 hours 


lenh (53) 


0.62 


-105.2 


-112.5 


9 hours 




0.38 


-125.5 


-126.6 


35 hours 




0.89 


-61.2 


-66.7 


45 hours 


Igab (45) 


0.72 


-88.6 


-89.7 


9 hours 




0.53 


-101.0 


-102.9 


35 hours 




0.89 


-64.1 


-66.3 


45 hours 


Igis (45) 


0.75 


-92.1 


-91.2 


9 hours 




0.59 


-106.4 


-109.2 


35 hours 




0.87 


-57.8 


-65.1 


45 hours 


Iguu (44) 


0.70 


-85.3 


-86.4 


9 hours 




0.48 


-102.9 


-103.7 


35 hours 




0.84 


-43.6 


-50.1 


45 hours 


Ivii (35) 


0.60 


-67.1 


-71.0 


9 hours 




0.33 


-82.7 


-86.7 


35 hours 




0.87 


-62.6 


-65.6 


45 hours 


lba5 (46) 


0.67 


-95.4 


-94.0 


9 hours 




0.43 


-111.0 


-116.4 


35 hours 



enables us to show the effectiveness of the DM search algorithm, as compared to PT. Several 
choices of potential parameters were explored in the process of trying to find a potential with the 
property that the native conformation is the lowest energy fold, but this proved to be impossible 
for the eight proteins studied. We believe the superior performance of the DM algorithm over PT 
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will extend to more realistic potentials as well. 

In this paper we demonstrate a new search algorithm, based not on the physical pathway of the 
folding process, but on the geometry of constraint spaces. Our results show the difference map 
algorithm is very efficient for finding low energy states for a given potential. The algorithm is 
both easy to implement, and is easy to run in parallel. It is our hope that this new method for 
finding low energy atomic configurations will facilitate the development of more precise atomic 
potentials, since the most important feature of a good potential is that the native fold has the lowest 
energy. 
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APPENDIX A 

1. Geometry constraint 

The geometry constraint ensures all the bond lengths and angles are correct (data from Engh 
1991 -), all the peptide bonds are in the trans orientation, and the Ramachandran angles are not 
in the sterically forbidden region of the Ramachandran plot. An atomic configuration satisfying 
these conditions is a valid protein conformation (rotamer). We use the following penalty function 
to implement the geometry constraint: 



For the first term of equation (|AH) . the index i runs over all the bonds in the protein. For each 
bond, Bi = Pi (vj ■ Vj — fef)^ Each bond has a target length, hi, a penalty weight pi, and Vj is the 
vector connecting the two atoms participating in the bond. Essentially, each Bi is a measure of 
how correct the bond is, and pi is the relative cost of the z"^ bond deviating from correct. For 
all backbone bonds (such as Cq,-C, C-N, or N-Cq, ) Pi is 4, for the bonds coming directly off the 
backbone (such as C-0, N-H, or Ca-C/s ) Pi is 2, and the bonds within a sidegroup are given a 
penalty weight of 1 . 

In the second term of equation (|A1|) . the index i runs over all the bond angles in the protein. 
The angle is defined by two vectors, Vj_i and Vj 2, and A = pi (vj_i ■ Vj 2 — Oj)^- Every angle 
has a target dot product for the two vectors, aj, and a penalty weight pi. Note that the magnitudes 
of Vj 1 and Vj 2 are each controlled by the bond constraint above. Like Bi above, Ai is a measure 
of how correct the z"^ angle is, and pi is the relative cost of the z"^ angle deviating from correct. 
For all backbone angles (such as Cq-C-N, C-N-Cq,, N-Cq-C) pi is 2, for angles involving bonds 
directly off the background (such as 0-C-C„, 0-C-N, H-N-C„, H-N-C, Cfj-Ca-C, and C/j-C^-N) 
Pi is 1, and for all angles within a sidegroup pi is 0.5. 

For the third term of equation (lAll) . the index i runs over backbone peptide bonds. These f2j 




(Al) 
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tenns ensure that backbone hydrogens and oxygens are in the trans configuration, and that the 
atoms H, N, C, and O all lie in a plane. Figure [5] shows the correct trans orientation of the atoms 
participating in a peptide bond, and will aid in visualizing the vectors described below. To ensure 
the trans configuration, terms i7j = pi (vj i ■ Vj 3 — do)^ are included. Here, Vj 1 is the C-0 vector 
and Vj_3 is the N-H vector for the i^^ peptide bond, and cIq is the correct dot product of these two 
vectors. The penalty weight pi is 1.5. To ensure the atoms H, N, C, and O all lie in a plane, 
terms Qi = pi (vj 1 ■ (vi,2 x Vj^s))^ are included. Here, Vj i and Vj,3 have the same meaning as 
before, Vj^2 is the C-N peptide bond vector, and the penalty weight pi is 0.2 . To ensure the trans 
configuration of consecutive Cq,'s, and to ensure that the four atoms Cq,, C, N, and Cq, all lie in 
a plane, there are identical constraints involving these four atoms. Note that this term to ensure 
planarity is redundant. The relevant angle constraints would, on their own, ensure the atoms lie 
in a plane. However, this additional, more direct planar constraint makes the overall projection to 
this constraint set converge faster. 




FIG. 5: A schematic of the atoms involved in the peptide bond. The atoms H, N, C, and O all lie in a plane. 
The H and O are in the trans configuration. 

The fourth term of equation (lAll) is for four-atom configurations where left-handed ver- 
sus right-handed orientations are relevant. For every amino acid (except glycine) the four 
atoms C/3, Cq,, C, and N define a parallelepiped, whose volume is constrained by A = 
Pi ■ (Vj^2 X Vj.s) — Vof, where Vj^i is the Ca-Cp vector, Vj_2 is the Cq-C vector , Vj^a is the 
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Cq-N vector, Vq is the target parallelepiped volume (note the sign of Vq dictates left handed versus 
right handed orientations), and the penalty weight pi is 1 . Also, for sidegroups that have a left 
handed versus right handed preference, such as the side group of isoleucine, there is an identical 
constraint relating the orientation of the relevant atoms. 

The last term of equation (lAll) controls the range of the Ramachandran angles and 'ip. Due 
to steric repulsion, there is a significant region of the Ramachandran plot that is inaccessible to 
proteins, shown in figure [6l Rather than calculating 0, it is easier to calculate the sine and the 
cosine of by: 

sin(p = \2 ■ (vs X vi) 





V2 




Vi X \2 




V2 X V3I 



COS = [(Vi ■ V2) (V2 ■ V3) - (Vi ■ V3) (V2 ■ V2)] 7^ ^-r-p ^ 

|Vi X V2I |V2 X V3I 

where Vi, V2, and V3 are the three vectors defining the torsion angle (p. Specifically, Vi is the C-N 
vector, V2 is the N-C^ vector, and V3 is the C^-C vector (see figure [S]). Similarly, the sine and the 
cosine of tp are calculated in the same way, except Vi, V2, and V3 are the three vectors defining the 
torsion angle tp, or Vi is the N-C^ vector, V2 is the Cq-C vector, and V3 is the C-N vector. 

In terms of (p and tp, the Ri in equation (lAll) is the sum of a function controlling the (p distribu- 
tion and the tp distribution. Specifically, 



Ri 



if 


/(0)< 


-0.26 


if 


/(0)> 


-0.26 


if 


g{tP) > 


-0.50 


if 


gW < 


-0.50 



where f{(p) = 0.83 sin (/) + 0.17 coscp 
and g{tp) = 0.56 sin -0 + 0.44 cos'0. 

Ri is shown by the shading in figure [6l The region defined by R^ > crudely approximates the 
region of the Ramachandran plot inaccessible due to steric repulsion, also shown in figure [6l 

A member of the geometry constraint has each of the penalty functions above equal to zero. For 
the projection to this constraint space, all five terms of equation (|A1|) are minimized by an adaptive 
step-size steepest descent algorithm, and are considered sufficiently close to zero when the total 
penalty is less than 0.001 per amino acid. The various energy weights pi used above were chosen 
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FIG. 6: The natural distribution of Ramachandran angles (blue area) is shown along with a greyscale of the 
Ramachandran penalty used in equation (lAll) . The distribution data is taken from Kleywegt 1996.— 98% 
of all non-glycine Ramachandran angles lie in the blue shaded area. 

such that this minimization is computed efficiently, and never frustrated. After the minimization, 
all of the bonds have the proper length, all of the angles are correct, all of the peptide bonds are 
planar and in the trans configuration, and all of the Ramachandran angles are in the allowed region 
of figure [6l However, an atomic configuration satisfying the geometry constraint may have non- 
bonded atoms overlapping. It is the fact that atoms are allowed to pass through each other that 
makes the minimization of the penalty function always successful, and never frustrated. 

2. Energy constraint 

The energy constraint set is defined as the set of atomic configurations with a non-bonded 
energy below a given target, Eq. The constraint space is thus dependent on the energy target 
Eq. An atomic configuration satisfying this condition does not necessarily have a valid peptide 
geometry, indeed bonded atoms may be quite separated, and the atomic configuration may not 
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TABLE II: Hydrophobicity values A negative number means the two atom types attract each other, positive 
numbers indicate repulsion. These parameters are based on those found in a previous study^^ , and have been 
adapted to our functional form of HPjj . 





Aliphatic Carbon 


Aromatic Carbon 


Polar 


Sulfur 


Aliphatic Carbon 


-0.108 


-0.075 


0.072 


-0.063 


Aromatic Carbon 


-0.075 


-0.081 


0.093 


-0.048 


Polar 


0.072 


0.093 


0.126 


0.084 


Sulfur 


-0.063 


-0.048 


0.084 


-0.126 



even resemble a polypeptide. In detail, this constraint is determined as follows: 

eo>enb= y1 Y1 Y1 

jj'Gatoms ij'gatoms iGH, j'GO 

The first term of equation (|A2|) prevents atoms from overlapping. Specifically, 

{0 if Tij > ro 

(l-(^f)y if ^o-" 

where r^j is the distance between the and j*'* atoms, and tq is the distance at which the atoms 
start overlapping. The tq used is a sum of the atomic van der Waals radii of the the and j*'' 
atoms. The following radii are used: 1.57A for aliphatic sidegroup carbon, 1.41A for aromatic 
sidegroup carbon, 1.44A for backbone carbon, 1.34A for nitrogens, 1.20A for oxygens, 0.65A for 
hydrogens, and l.SVA for sulfur. These radii are based on data from a previous study,^^ and have 
been adapted to our VEjj functional form. Note VEjj goes to 1 at an atomic separation of r^j = 0, 
and smoothly goes to zero at rij = tq. Atoms that are bonded together must be treated specially, 
since they must be allowed to come closer together than non-bonded atoms. For these pairs, tq is 
40% of the sum of constituent atomic van der Waals radii. 

The second term of equation (|A2I) simulates the entropic interaction of water with various 
sidegroup atoms. Two atoms only interact via this hydrophobic energy if they are both sidegroup 
atoms (C/3, C^, etc.), and they belong to different amino acids. The functional form of HPjj is. 




if rij > ro 
if Tij < ro 
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where ro is calculated the same as above, and Eij is the interaction energy depending on the 
participating atom types. The energies used are shown in table |Ill Since hydrophobic atoms, in 
this model, attract each other, they tend to form a well defined oily core, while polar atoms repel 
every atom type, and thereby tend to inhabit the surface of the protein. 

The final energy of equation (IA2I) represents hydrogen bonding. The indices i and j run over 
all the backbone hydrogen atoms and all the backbone oxygen atoms respectively. The functional 
form of HBjj is the product of a distance dependent function /(rjj) and a function of the two 
angles g{9a, 9b) formed by the C-O-H angle 9a, and the 0-H-N angle 9), (see figure|7]). This energy 
function is essentially a rescaling of the one used by Irback et. al. 2000.^ In terms of the distance 
function/ (rjj) and the angular function g{9a, 9),), HBjj is. 



where 



9{9a,9b 



finj) g{9a,9b) 



-1.5 ( 2 



if > ro 

-1.5 if rij < ro 

cos^ 9a cos^ 9h if cos 9a > and cos 9i, > 
otherwise 



The target hydrogen bond distance, ro, is 1.9A, and the distance between the 2* hydrogen and the 
oxygen is rij. Also, two backbone atoms can form hydrogen bonds only if they are separated 
by at least two other amino acids. 




FIG. 7: The definition of 9a and 9^ for hydrogen bonding. The dotted line is the hydrogen bond. 
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The three energies in equation (|A2h . in total, constitute the non-bonded energy of an atomic 
configuration. The energy constraint space is defined as the set of atomic configurations whose 
non-bonded energy is less than a predefined target energy, Eq. The projection to this constraint 
space is done by minimizing the total energy with an adaptive step- size steepest descent algorithm 
until the total energy is equal to Eq. 

Note finally that an atomic configuration whose non-bonded energy is less than Eq (and there- 
fore a member of this constraint space) does not necessarily have a valid peptide geometry. For 
example, it is possible to have two bonded atoms separated by large distances, and the atomic 
configuration can still be a member of this constraint space. The energy functions above treat the 
atomic configuration as a collection of independent atoms, rather than a linked chain. 
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